Text data is unstructured. But if you want to extract information from text, then you often need to process that data into a more structured representation. The common idea for all Natural Language Processing (NLP) tools is that they try to structure or transform text in some meaningful way. You have already learned about four basic NLP steps: sentence splitting, ōtokenization, POS-tagging and lemmatization. For all of these, we have used the NLTK library, which is widely used in the field of NLP. However, there are some competitors out there that are worthwhile to have a look at. One of them is spaCy, which is fast and accurate and supports multiple languages.
At the end of this chapter, you will be able to:
There are many tools and libraries designed to solve NLP problems. In Chapter 15, we have already seen the NLTK library for tokenization, sentence splitting, part-of-speech tagging and lemmatization. However, there are many more NLP tasks and off-the-shelf tools to perform them. These tasks often depend on each other and are therefore put into a sequence; such a sequence of NLP tasks is called an NLP pipeline. Some of the most common NLP tasks are:
You don't always need all these modules. But it's important to know that they are there, so that you can use them when the need arises.
Let's be clear about this: you don't always need to use Python for this. There are some very strong NLP programs out there that don't rely on Python. You can typically call these programs from the command line. Some examples are:
Treetagger is a POS-tagger and lemmatizer in one. It provides support for many different languages. If you want to call Treetagger from Python, use treetaggerwrapper. Treetagger-python also works, but is much slower.
Stanford's CoreNLP is a very powerful system that is able to process English, German, Spanish, French, Chinese and Arabic. (Each to a different extent, though. The pipeline for English is most complete.) There are also Python wrappers available, such as py-corenlp.
The Maltparser has models for English, Swedish, French, and Spanish.
Having said that, there are many NLP-tools that have been developed for Python:
spaCy provides a rather complete NLP pipeline: it takes a raw document and performs tokenization, POS-tagging, stop word recognition, morphological analysis, lemmatization, sentence splitting, dependency parsing and Named Entity Recognition (NER). It also supports similarity prediction, but that is outside of the scope of this notebook. The advantage of SpaCy is that it is really fast, and it has a good accuracy. In addition, it currently supports multiple languages, among which: English, German, Spanish, Portugese, French, Italian and Dutch.
In this notebook, we will show you the basic usage. If you want to learn more, please visit spaCy's website; it has extensive documentation and provides excellent user guides.
To install spaCy, check out the instructions here. On this page, it is explained exactly how to install spaCy for your operating system, package manager and desired language model(s). Simply run the suggested commands in your terminal or cmd. Alternatively, you can probably also just run the following cells in this notebook:
In [ ]:
%%bash
conda install -c conda-forge spacy
In [ ]:
%%bash
python -m spacy download en
Now, let's first load spaCy. We import the spaCy module and load the English tokenizer, tagger, parser, NER and word vectors.
In [ ]:
import spacy
nlp = spacy.load('en') # other languages: de, es, pt, fr, it, nl
nlp
is now a Python object representing the English NLP pipeline that we can use to process a text.
For English, there are three models ranging from 'small' to 'large':
By default, the smallest one is loaded. Larger models should have a better accuracy, but take longer to load. If you like, you can use them instead. You will first need to download them.
In [ ]:
#%%bash
#python -m spacy download en_core_web_md
In [ ]:
#%%bash
#python -m spacy download en_core_web_lg
In [ ]:
# uncomment one of the lines below if you want to load the medium or large model instead of the small one
#nlp = spacy.load('en_core_web_md')
nlp = spacy.load('en_core_web_lg')
In [ ]:
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")
doc
is now a Python object of the class Doc
. It is a container for accessing linguistic annotations and a sequence of Token
objects.
At this point, there are three important types of objects to remember:
Doc
is a sequence of Token
objects.Token
object represents an individual token — i.e. a word, punctuation symbol, whitespace, etc. It has attributes representing linguistic annotations. Span
object is a slice from a Doc
object and a sequence of Token
objects.Since Doc
is a sequence of Token
objects, we can iterate over all of the tokens in the text as shown below, or select a single token from the sequence:
In [ ]:
# Iterate over the tokens
for token in doc:
print(token)
print()
# Select one single token by index
first_token = doc[0]
print("First token:", first_token)
Please note that even though these look like strings, they are not:
In [ ]:
for token in doc:
print(token, "\t", type(token))
These Token
objects have many useful methods and attributes, which we can list by using dir()
. We haven't really talked about attributes during this course, but while methods are operations or activities performed by that object, attributes are 'static' features of the objects. Methods are called using parantheses (as we have seen with str.upper()
, for instance), while attributes are indicated without parantheses. We will see some examples below.
You can find more detailed information about the token methods and attributes in the documentation.
In [ ]:
dir(first_token)
Let's inspect some of the attributes of the tokens. Can you figure out what they mean? Feel free to try out a few more.
In [ ]:
# Print attributes of tokens
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_)
Notice that some of the attributes end with an underscore. For example, tokens have both lemma
and lemma_
attributes. The lemma
attribute represents the id of the lemma (integer), while the lemma_
attribute represents the unicode string representation of the lemma. In practice, you will mostly use the lemma_
attribute.
In [ ]:
for token in doc:
print(token.lemma(), token.lemma_)
You can also use spacy.explain to find out more about certain labels:
In [ ]:
# try out some more, such as NN, ADP, PRP, VBD, VBP, VBZ, WDT, aux, nsubj, pobj, dobj, npadvmod
spacy.explain("VBZ")
You can create a Span
object from the slice doc[start : end]. For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as Span
objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.
In [ ]:
# Create a Span
a_slice = doc[2:5]
print(a_slice, type(a_slice))
# Iterate over Span
for token in a_slice:
print(token.lemma_, token.pos_)
If you call the dir()
function on a Doc
object, you will see that it has a range of methods and attributes. You can read more about them in the documentation. Below, we highlight three of them: text
, sents
and noun_chunks
.
In [ ]:
dir(doc)
First of all, text
simply gives you the whole document as a string:
In [ ]:
print(doc.text)
print(type(doc.text))
sents
can be used to get all the sentences. Notice that it will create a so-called 'generator'. For now, you don't have to understand exactly what a generator is (if you like, you can read more about them online). Just remember that we can use generators to iterate over an object in a fast and efficient way.
In [ ]:
# Get all the sentences as a generator
print(doc.sents, type(doc.sents))
# We can use the generator to loop over the sentences; each sentence is a span of tokens
for sentence in doc.sents:
print(sentence, type(sentence))
If you find this difficult to comprehend, you can also simply convert it to a list and then loop over the list. Remember that this is less efficient, though.
In [ ]:
# You can also store the sentences in a list and then loop over the list
sentences = list(doc.sents)
for sentence in sentences:
print(sentence, type(sentence))
The benefit of converting it to a list is that we can use indices to select certain sentences. For example, in the following we only print some information about the tokens in the second sentence.
In [ ]:
# Print some information about the tokens in the second sentence.
sentences = list(doc.sents)
for token in sentences[1]:
data = '\t'.join([token.orth_,
token.lemma_,
token.pos_,
token.tag_,
str(token.i), # Turn index into string
str(token.idx)]) # Turn index into string
print(data)
Similarly, noun_chunks
can be used to create a generator for all noun chunks in the text.
In [ ]:
# Get all the noun chunks as a generator
print(doc.noun_chunks, type(doc.noun_chunks))
# You can loop over a generator; each noun chunk is a span of tokens
for chunk in doc.noun_chunks:
print(chunk, type(chunk))
print()
In [ ]:
# Here's a slightly longer text, from the Wikipedia page about Harry Potter.
harry_potter = "Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\
The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley,\
all of whom are students at Hogwarts School of Witchcraft and Wizardry.\
The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal,\
overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles."
doc = nlp(harry_potter)
print(doc.ents)
print(type(doc.ents))
In [ ]:
# Each entity is a span of tokens and is labeled with the type of entity
for entity in doc.ents:
print(entity, "\t", entity.label_, "\t", type(entity))
Pretty cool, but what does NORP mean? Again, you can use spacy.explain() to find out:
Another very popular NLP pipeline is Stanford CoreNLP. You can use the tool from the command line, but there are also some useful Python wrappers that make use of the Stanford CoreNLP API, such as pycorenlp. As you might want to use this in the future, we will provide you with a quick start guide. To use the code below, you will have to do the following:
pip install pycorenlp
in your terminal, or simply run the cell below).cd LOCATION_OF_CORENLP/stanford-corenlp-full-2018-02-27
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
In [ ]:
#%%bash
#pip install pycorenlp
In [ ]:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
Next, you will want to define which annotators to use and which output format should be produced (text, json, xml, conll, conllu, serialized). Annotating the document then is very easy. Note that Stanford CoreNLP uses some large models that can take a long time to load. You can read more about it here.
In [ ]:
harry_potter = "Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\
The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley,\
all of whom are students at Hogwarts School of Witchcraft and Wizardry.\
The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal,\
overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles."
# Define annotators and output format
properties= {'annotators': 'tokenize, ssplit, pos, lemma, parse',
'outputFormat': 'json'}
# Annotate the string with CoreNLP
doc = nlp.annotate(harry_potter, properties=properties)
In the next cells, we will simply show some examples of how to access the linguistic annotations if you use the properties as shown above. If you'd like to continue working with Stanford CoreNLP in the future, you will likely have to experiment a bit more.
In [ ]:
doc.keys()
In [ ]:
sentences = doc["sentences"]
first_sentence = sentences[0]
first_sentence.keys()
In [ ]:
first_sentence["parse"]
In [ ]:
first_sentence["basicDependencies"]
In [ ]:
first_sentence["tokens"]
In [ ]:
for sent in doc["sentences"]:
for token in sent["tokens"]:
word = token["word"]
lemma = token["lemma"]
pos = token["pos"]
print(word, lemma, pos)
In [ ]:
# find out what the entity label 'NORP' means
spacy.explain("NORP")
There might be different reasons why you want to use NLTK, spaCy or Stanford CoreNLP. There are differences in efficiency, quality, user friendliness, functionalities, output formats, etc. At this moment, we advise you to go with spaCy because of its ease in use and high quality performance.
Here's an example of both NLTK and spaCy in action.
In [ ]:
import nltk
import spacy
nlp = spacy.load('en')
In [ ]:
text = "I like cheese very much"
print("NLTK results:")
nltk_tagged = nltk.pos_tag(text.split())
print(nltk_tagged)
print()
print("spaCy results:")
doc = nlp(text)
spacy_tagged = []
for token in doc:
tag_data = (token.orth_, token.tag_,)
spacy_tagged.append(tag_data)
print(spacy_tagged)
Do you want to learn more about the differences between NLTK, spaCy and CoreNLP? Here are some links:
Data is often messy, noisy or includes irrelevant information. Therefore, chances are big that you will need to do some cleaning before you can start with your analysis. This is especially true for social media texts, such as tweets, chats, and emails. Typically, these texts are informal and notoriously noisy. Normalising them to be able to process them with NLP tools is a NLP challenge in itself and fully discussing it goes beyond the scope of this course. However, you may find the following modules useful in your project:
If you are interested in reading more about these topic, these papers discuss preprocessing and normalization:
And here is a nice blog about character encoding.
In [ ]:
import spacy
nlp = spacy.load('en')
In [ ]:
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")
for token in doc:
print(token.pos_, token.tag_)
In [ ]:
spacy.explain("PRON")
Let's practice a bit with processing files. Open the file charlie.txt
for reading and use read()
to read its content as a string. Then use spaCy to annotate this string and print the information below. Remember: you can use dir()
to remind yourself of the attributes.
For each token in the text:
For each sentence in the text:
For each noun chunk in the text:
For each named entity in the text:
In [ ]:
filename = "../Data/Charlie/charlie.txt"
# read the file and process with spaCy
In [ ]:
# print all information about the tokens
In [ ]:
# print all information about the sentences
In [ ]:
# print all information about the noun chunks
In [ ]:
# print all information about the entities
In [ ]:
import glob
filenames = glob.glob("../Data/dreams/*.txt")
print(filenames)
Now create a function called get_vocabulary
that takes one positional parameter filenames
. It should read in all filenames
and return a set called unique_words
, that contains all unique words in the files.
In [ ]:
def get_vocabulary(filenames):
# your code here
# test your function here
unique_words = get_vocabulary(filenames)
print(unique_words, len(unique_words))
assert len(unique_words) == 415 # if your code is correct, this should not raise an error
Create a function called get_sentences_with_keyword
that takes one positional parameter filenames
and one keyword parameter filenames
with default value None
. It should read in all filenames
and return a list called sentences
that contains all sentences (the complete texts) with the keyword.
Hints:
In [ ]:
import glob
filenames = glob.glob("../Data/dreams/*.txt")
print(filenames)
In [ ]:
def get_sentences_with_keyword(filenames, keyword=None):
#your code here
# test your function here
sentences = get_sentences_with_keyword(filenames, keyword="toy")
print(sentences)
assert len(sentences) == 4 # if your code is correct, this should not raise an error